After you've looked at the data you're working with and, in this case, know the shapes of the images and of the keypoints, you are ready to define a convolutional neural network that can learn from this data.
In this notebook and in models.py, you will:
* What does well mean?
"Well" means that the model's loss decreases during training and, when applied to test image data, the model produces keypoints that closely match the true keypoints of each face. And you'll see examples of this later in the notebook.
Recall that CNN's are defined by a few types of layers:
You are required to use the above layers and encouraged to add multiple convolutional layers and things like dropout layers that may prevent overfitting. You are also encouraged to look at literature on keypoint detection, such as this paper, to help you determine the structure of your network.
models.py file¶This file is mostly empty but contains the expected name and some TODO's for creating your model.
To define a neural network in PyTorch, you define the layers of a model in the function __init__ and define the feedforward behavior of a network that employs those initialized layers in the function forward, which takes in an input image tensor, x. The structure of this Net class is shown below and left for you to fill in.
Note: During training, PyTorch will be able to perform backpropagation by keeping track of the network's feedforward behavior and using autograd to calculate the update to the weights in the network.
__init__¶As a reminder, a conv/pool layer may be defined like this (in __init__):
# 1 input image channel (for grayscale images), 32 output channels/feature maps, 3x3 square convolution kernel
self.conv1 = nn.Conv2d(1, 32, 3)
# maxpool that uses a square window of kernel_size=2, stride=2
self.pool = nn.MaxPool2d(2, 2)
forward¶Then referred to in the forward function like this, in which the conv1 layer has a ReLu activation applied to it before maxpooling is applied:
x = self.pool(F.relu(self.conv1(x)))
Best practice is to place any layers whose weights will change during the training process in __init__ and refer to them in the forward function; any layers or functions that always behave in the same way, such as a pre-defined activation function, should appear only in the forward function.
You are tasked with defining the network in the models.py file so that any models you define can be saved and loaded by name in different notebooks in this project directory. For example, by defining a CNN class called Net in models.py, you can then create that same architecture in this and other notebooks by simply importing the class and instantiating a model:
from models import Net
net = Net()
# load the data if you need to; if you have already loaded the data, you may comment this cell out
# -- DO NOT CHANGE THIS CELL -- #
!mkdir /data
!wget -P /data/ https://s3.amazonaws.com/video.udacity-data.com/topher/2018/May/5aea1b91_train-test-data/train-test-data.zip
!unzip -n /data/train-test-data.zip -d /data
from workspace_utils import active_session
with active_session():
train_model(num_epochs)
# import the usual resources
import matplotlib.pyplot as plt
import numpy as np
# import utilities to keep workspaces alive during model training
from workspace_utils import active_session
# watch for any changes in model.py, if it changes, re-load it automatically
%load_ext autoreload
%autoreload 2
## TODO: Define the Net in models.py
import torch
import torch.nn as nn
import torch.nn.functional as F
## TODO: Once you've define the network, you can instantiate it
# one example conv layer has been provided for you
from models import Net
net = Net()
print(net)
To prepare for training, create a transformed dataset of images and keypoints.
In PyTorch, a convolutional neural network expects a torch image of a consistent size as input. For efficient training, and so your model's loss does not blow up during training, it is also suggested that you normalize the input images and keypoints. The necessary transforms have been defined in data_load.py and you do not need to modify these; take a look at this file (you'll see the same transforms that were defined and applied in Notebook 1).
To define the data transform below, use a composition of:
These transformations have been defined in data_load.py, but it's up to you to call them and create a data_transform below. This transform will be applied to the training data and, later, the test data. It will change how you go about displaying these images and keypoints, but these steps are essential for efficient training.
As a note, should you want to perform data augmentation (which is optional in this project), and randomly rotate or shift these images, a square image size will be useful; rotating a 224x224 image by 90 degrees will result in the same shape of output.
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils
# the dataset we created in Notebook 1 is copied in the helper file `data_load.py`
from data_load import FacialKeypointsDataset
# the transforms we defined in Notebook 1 are in the helper file `data_load.py`
from data_load import Rescale, RandomCrop, Normalize, ToTensor
## TODO: define the data_transform using transforms.Compose([all tx's, . , .])
# order matters! i.e. rescaling should come before a smaller crop
data_transform = transforms.Compose([Rescale(250),
RandomCrop(224),
Normalize(),
ToTensor()])
# testing that you've defined a transform
assert(data_transform is not None), 'Define a data_transform'
# create the transformed dataset
transformed_dataset = FacialKeypointsDataset(csv_file='/data/training_frames_keypoints.csv',
root_dir='/data/training/',
transform=data_transform)
print('Number of images: ', len(transformed_dataset))
# iterate through the transformed dataset and print some stats about the first few samples
for i in range(4):
sample = transformed_dataset[i]
print(i, sample['image'].size(), sample['keypoints'].size())
plt.imshow(np.squeeze(sample['image'].data.numpy()), cmap='gray')
plt.show()
Next, having defined the transformed dataset, we can use PyTorch's DataLoader class to load the training data in batches of whatever size as well as to shuffle the data for training the model. You can read more about the parameters of the DataLoader, in this documentation.
Decide on a good batch size for training your model. Try both small and large batch sizes and note how the loss decreases as the model trains. Too large a batch size may cause your model to crash and/or run out of memory while training.
Note for Windows users: Please change the num_workers to 0 or you may face some issues with your DataLoader failing.
# load training data in batches
batch_size = 20
train_loader = DataLoader(transformed_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=1)
Take a look at how this model performs before it trains. You should see that the keypoints it predicts start off in one spot and don't match the keypoints on a face at all! It's interesting to visualize this behavior so that you can compare it to the model after training and see how the model has improved.
The test dataset is one that this model has not seen before, meaning it has not trained with these images. We'll load in this test data and before and after training, see how your model performs on this set!
To visualize this test data, we have to go through some un-transformation steps to turn our images into python images from tensors and to turn our keypoints back into a recognizable range.
# load in the test data, using the dataset class
# AND apply the data_transform you defined above
# create the test dataset
test_dataset = FacialKeypointsDataset(csv_file='/data/test_frames_keypoints.csv',
root_dir='/data/test/',
transform=data_transform)
# load test data in batches
batch_size = 20
test_loader = DataLoader(test_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=1)
To test the model on a test sample of data, you have to follow these steps:
This function test how the network performs on the first batch of test data. It returns the images, the transformed images, the predicted keypoints (produced by the model), and the ground truth keypoints.
# test the model on a batch of test images
def net_sample_output():
# iterate through the test dataset
for i, sample in enumerate(test_loader):
# get sample data: images and ground truth keypoints
images = sample['image']
key_pts = sample['keypoints']
# convert images to FloatTensors
images = images.type(torch.FloatTensor)
# forward pass to get net output
output_pts = net(images)
# reshape to batch_size x 68 x 2 pts
output_pts = output_pts.view(output_pts.size(0), 68, -1)
# break after first image is tested
if i == 0:
return images, output_pts, key_pts
If you get a size or dimension error here, make sure that your network outputs the expected number of keypoints! Or if you get a Tensor type error, look into changing the above code that casts the data into float types: images = images.type(torch.FloatTensor).
# call the above function
# returns: test images, test predicted keypoints, test ground truth keypoints
test_images, test_outputs, gt_pts = net_sample_output()
# print out the dimensions of the data to see if they make sense
print(test_images.data.size())
print(test_outputs.data.size())
print(gt_pts.data.size())
Once we've had the model produce some predicted output keypoints, we can visualize these points in a way that's similar to how we've displayed this data before, only this time, we have to "un-transform" the image/keypoint data to display it.
Note that I've defined a new function, show_all_keypoints that displays a grayscale image, its predicted keypoints and its ground truth keypoints (if provided).
def show_all_keypoints(image, predicted_key_pts, gt_pts=None):
"""Show image with predicted keypoints"""
# image is grayscale
plt.imshow(image, cmap='gray')
plt.scatter(predicted_key_pts[:, 0], predicted_key_pts[:, 1], s=20, marker='.', c='m')
# plot ground truth points as green pts
if gt_pts is not None:
plt.scatter(gt_pts[:, 0], gt_pts[:, 1], s=20, marker='.', c='g')
Next, you'll see a helper function. visualize_output that takes in a batch of images, predicted keypoints, and ground truth keypoints and displays a set of those images and their true/predicted keypoints.
This function's main role is to take batches of image and keypoint data (the input and output of your CNN), and transform them into numpy images and un-normalized keypoints (x, y) for normal display. The un-transformation process turns keypoints and images into numpy arrays from Tensors and it undoes the keypoint normalization done in the Normalize() transform; it's assumed that you applied these transformations when you loaded your test data.
# visualize the output
# by default this shows a batch of 10 images
def visualize_output(test_images, test_outputs, gt_pts=None, batch_size=10):
for i in range(batch_size):
plt.figure(figsize=(30,30))
ax = plt.subplot(1, batch_size, i+1)
# un-transform the image data
image = test_images[i].data # get the image from it's Variable wrapper
image = image.numpy() # convert to numpy array from a Tensor
image = np.transpose(image, (1, 2, 0)) # transpose to go from torch to numpy image
# un-transform the predicted key_pts data
predicted_key_pts = test_outputs[i].data
predicted_key_pts = predicted_key_pts.numpy()
# undo normalization of keypoints
predicted_key_pts = predicted_key_pts*50.0+100
# plot ground truth points for comparison, if they exist
ground_truth_pts = None
if gt_pts is not None:
ground_truth_pts = gt_pts[i]
ground_truth_pts = ground_truth_pts*50.0+100
# call show_all_keypoints
show_all_keypoints(np.squeeze(image), predicted_key_pts, ground_truth_pts)
plt.axis('off')
plt.show()
visualize_output(test_images, test_outputs, gt_pts)
Training a network to predict keypoints is different than training a network to predict a class; instead of outputting a distribution of classes and using cross entropy loss, you may want to choose a loss function that is suited for regression, which directly compares a predicted value and target value. Read about the various kinds of loss functions (like MSE or L1/SmoothL1 loss) in this documentation.
Next, you'll define how the model will train by deciding on the loss function and optimizer.
## TODO: Define the loss and optimization
import torch.optim as optim
import torch.nn as nn
criterion = nn.L1Loss()
optimizer = optim.Adam(net.parameters(), lr=0.0001)
Now, you'll train on your batched training data from train_loader for a number of epochs.
To quickly observe how your model is training and decide on whether or not you should modify it's structure or hyperparameters, you're encouraged to start off with just one or two epochs at first. As you train, note how your the model's loss behaves over time: does it decrease quickly at first and then slow down? Does it take a while to decrease in the first place? What happens if you change the batch size of your training data or modify your loss function? etc.
Use these initial observations to make changes to your model and decide on the best architecture before you train for many epochs and create a final model.
def train_net(n_epochs):
# prepare the net for training
net.train()
for epoch in range(n_epochs): # loop over the dataset multiple times
running_loss = 0.0
# train on batches of data, assumes you already have train_loader
for batch_i, data in enumerate(train_loader):
# get the input images and their corresponding labels
images = data['image']
key_pts = data['keypoints']
# flatten pts
key_pts = key_pts.view(key_pts.size(0), -1)
# convert variables to floats for regression loss
key_pts = key_pts.type(torch.FloatTensor)
images = images.type(torch.FloatTensor)
# forward pass to get outputs
output_pts = net(images)
# calculate the loss between predicted and target keypoints
loss = criterion(output_pts, key_pts)
# zero the parameter (weight) gradients
optimizer.zero_grad()
# backward pass to calculate the weight gradients
loss.backward()
# update the weights
optimizer.step()
# print loss statistics
running_loss += loss.item()
if batch_i % batch_size == batch_size-1 : # print every 10 batches
print('Epoch: {}, Batch: {}, Avg. Loss: {}'.format(epoch + 1, batch_i+1, running_loss/10))
running_loss = 0.0
print('Finished Training')
# train your network
n_epochs = 30 # start small, and increase when you've decided on your model structure and hyperparams
this is a Workspaces-specific context manager to keep the connection
# alive while training your model, not part of pytorch
with active_session():
train_net(n_epochs)
See how your model performs on previously unseen, test data. We've already loaded and transformed this data, similar to the training data. Next, run your trained model on these images to see what kind of keypoints are produced. You should be able to see if your model is fitting each new face it sees, if the points are distributed randomly, or if the points have actually overfitted the training data and do not generalize.
# get a sample of test data again
test_images, test_outputs, gt_pts = net_sample_output()
print(test_images.data.size())
print(test_outputs.data.size())
print(gt_pts.size())
## TODO: visualize your test output
# you can use the same function as before, by un-commenting the line below:
visualize_output(test_images, test_outputs, gt_pts, 20)
Once you've found a good model (or two), save your model so you can load it and use it later!
Save your models but please delete any checkpoints and saved models before you submit your project otherwise your workspace may be too large to submit.
## TODO: change the name to something uniqe for each new model
model_dir = 'saved_models/'
model_name = 'keypoints_model_1.pt'
# after training, save your model parameters in the dir 'saved_models'
torch.save(net.state_dict(), model_dir+model_name)
After you've trained a well-performing model, answer the following questions so that we have some insight into your training and architecture selection process. Answering all questions is required to pass this project.
I used Mean Absolute Error (MAE) as loss function because it is less sensitive to the big errors/mistakes of model during learning process than MSE loss function. The facial key points (FKPs) estimated by MAE are closer to the ground truth points (GTPs) than those predicted by MSE or even SmoothL1Loss (Huber loss function). Some of FKPs estimated by MSE and SmoothL1Loss appeared to be randomly scattered and don't overlap with the GTPs. The results of model and loss progress using Huber, MSE and MAE Loss functions were compared with one another. The model with MAE loss function has better performence compared with the one with two other loss functions. However the models with MAE or with the two other loss functions converge more or less the same. The Adaptive Moment Estimation (Adam) was used as optimizer over SGD as it optimizes the model faster and more stable.
I started with four convolutional and two linear layers. The estimated facial key points(FKP) appeared to be more similar to certain faces/shapes and the model was only able to estimate FKP of certain faces in testing dataset, which was a sign that the model was overfitted. The other problem was that the model couldn't estimate some parts of face meaning that it was not able to capture the whole structures/features of faces. Therefore I added one dropout layer between two linear layers to randomly turn the nodes on and off during the training process to overcome overfitting. One additional convolutional layer was added to give the model the chance to capture more complex features of faces. Futhermore the number of perceptrons have been increased to enable the model to describe more non-linear distributions of data while any additional linear layers was not added to avoid overfitting. In addition, a batch normalization layer was added to speed up the learning process and make it more robust. After these modifications the model improved a lot!
At first the batch_size was set to 5 but I observed that the model didn't converge fast enough. The batch_size 40 has been also tested, which has degraded the accuracy of model using testing set. I realized that the number 20 is quite a good value for batch_size. For every new modification, the number of epochs were set first to three and if the model converged continuously and reasonably, I increased it to 30. I didn't try very high epochs to avoid overfitting the model.
Sometimes, neural networks are thought of as a black box, given some input, they learn to produce some output. CNN's are actually learning to recognize a variety of spatial patterns and you can visualize what each convolutional layer has been trained to recognize by looking at the weights that make up each convolutional kernel and applying those one at a time to a sample image. This technique is called feature visualization and it's useful for understanding the inner workings of a CNN.
In the cell below, you can see how to extract a single filter (by index) from your first convolutional layer. The filter should appear as a grayscale grid.
# Get the weights in the first conv layer, "conv1"
# if necessary, change this to reflect the name of your first conv layer
weights1 = net.conv1.weight.data
w = weights1.numpy()
filter_index = 0
print(w[filter_index][0])
print(w[filter_index][0].shape)
print(w.shape)
# display the filter weights
plt.imshow(w[filter_index][0], cmap='gray')
Each CNN has at least one convolutional layer that is composed of stacked filters (also known as convolutional kernels). As a CNN trains, it learns what weights to include in it's convolutional kernels and when these kernels are applied to some input image, they produce a set of feature maps. So, feature maps are just sets of filtered images; they are the images produced by applying a convolutional kernel to an input image. These maps show us the features that the different layers of the neural network learn to extract. For example, you might imagine a convolutional kernel that detects the vertical edges of a face or another one that detects the corners of eyes. You can see what kind of features each of these kernels detects by applying them to an image. One such example is shown below; from the way it brings out the lines in an the image, you might characterize this as an edge detection filter.

Next, choose a test image and filter it with one of the convolutional kernels in your trained CNN; look at the filtered output to get an idea what that particular kernel detects.
##TODO: load in and display any image from the transformed test dataset
## TODO: Using cv's filter2D function,
## apply a specific set of filter weights (like the one displayed above) to the test image
import cv2
test_it = iter(test_loader)
sample_test = test_it.next()
images = sample_test['image'].data.numpy()
weights1 = net.conv1.weight.data.numpy()
weights3 = net.conv3.weight.data.numpy()
weights4 = net.conv4.weight.data.numpy()
image = images[0,0,:,:]
def visu_applied_trained_weight_and_filtered_layer(weights, layer_name, img):
n_rows= int( weights.shape[0] / 10) *2
fig, ax =plt.subplots(n_rows, 10, figsize=(40, 40))
row=0
for i in range (weights.shape[0]- weights.shape[0] %10):
ax[row, i%10].imshow(weights[i,:,:], cmap='gray')
filtered_img = cv2.filter2D(img, -1, weights[i,:,:])
ax[row+1, i%10].imshow( filtered_img, cmap='gray')
if (i % 10 ==9 and i != 0):
row+=2
fig.suptitle('trained weights of %s conv layer and corresponding conv layers' % layer_name, fontsize=16)
plt.setp(ax, xticks=[], yticks=[])
plt.show()
visu_applied_trained_weight_and_filtered_layer(weights1[:,0,:,:], 'first layer', image)
visu_applied_trained_weight_and_filtered_layer(weights3[0:64,1,:,:], 'third layer', image)
visu_applied_trained_weight_and_filtered_layer(weights4[0:64,3,:,:], 'forth', image)
I applied the trained filters of first, third and forth layers of model. It appeared that the filters of first layer extract the different egdes. The filters of third and forth layers tried to capture the more specific structures such as arrows, eyes and etc.
Now that you've defined and trained your model (and saved the best model), you are ready to move on to the last notebook, which combines a face detector with your saved model to create a facial keypoint detection system that can predict the keypoints on any face in an image!